Used Packages

Loading libaries:

library("RCurl")
library(dplyr)
library(readr)
library(tidyr)
library(stringr)
library(knitr)
library(kableExtra)
library(mice)
library(VIM)
library(e1071)
library(ggplot2)
library(corrplot)
library(plotly)
library(caret)
library(doMC)
library(xgboost)

Listing of libraries loaded during code execution can be found in the cell below:

## [1] "minqa, class, rio, rstudioapi, prodlim, lubridate, ranger, xml2, codetools, splines, robustbase, zeallot, jsonlite, nloptr, broom, compiler, httr, backports, assertthat, Matrix, lazyeval, htmltools, tools, gtable, glue, reshape2, Rcpp, carData, cellranger, vctrs, nlme, lmtest, timeDate, xfun, gower, laeken, openxlsx, lme4, rvest, lifecycle, pan, DEoptimR, MASS, zoo, scales, ipred, hms, yaml, curl, rpart, stringi, boot, zip, lava, rlang, pkgconfig, evaluate, purrr, recipes, htmlwidgets, tidyselect, plyr, magrittr, R6, generics, mitml, pillar, haven, foreign, withr, survival, abind, sp, nnet, tibble, crayon, car, jomo, rmarkdown, readxl, forcats, ModelMetrics, vcd, digest, webshot, stats4, munsell, viridisLite"

Loading Data

Data is loded accoring to the metadata provided in the params object, where a location of a dataset (both local and url) is provided. Mentioned object contains also information about expected form of missing data. In this case

if(file.exists(params$dataset_name)) {
  dataset <- read.csv(params$dataset_name, na.strings = params$na_chars)
} else if(url.exists(params$dataset_url)){
  download.file(params$dataset_url, params$dataset_name)
  dataset <- read.csv(params$dataset_name, na.strings = params$na_chars)
} else {
  stop("There is no file nor url resource to work with!")
}

Dataset

To see dimensionality we’ve used dim function:

dim(dataset)
## [1] 52582    16

There are 52582 row and 16 columns.

Here are names of all columns available in a raw dataset:

dataset %>%
  colnames %>%
  print
##  [1] "X"      "length" "cfin1"  "cfin2"  "chel1"  "chel2"  "lcop1" 
##  [8] "lcop2"  "fbar"   "recr"   "cumf"   "totaln" "sst"    "sal"   
## [15] "xmonth" "nao"

First sight at some rows from dataset will help us get better view on the data.

head(dataset)
##   X length   cfin1   cfin2   chel1    chel2   lcop1    lcop2  fbar   recr
## 1 0   23.0 0.02778 0.27785 2.46875       NA 2.54787 26.35881 0.356 482831
## 2 1   22.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 3 2   25.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 4 3   25.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 5 4   24.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831
## 6 5   22.0 0.02778 0.27785 2.46875 21.43548 2.54787       NA 0.356 482831
##        cumf   totaln      sst      sal xmonth nao
## 1 0.3059879 267380.8 14.30693 35.51234      7 2.8
## 2 0.3059879 267380.8 14.30693 35.51234      7 2.8
## 3 0.3059879 267380.8 14.30693 35.51234      7 2.8
## 4 0.3059879 267380.8 14.30693 35.51234      7 2.8
## 5 0.3059879 267380.8 14.30693 35.51234      7 2.8
## 6 0.3059879 267380.8 14.30693 35.51234      7 2.8

Names of some columns may be ambiguous. This is brief explanation:

## Dataset contains following attributes:
## - length - length of the fished herring [cm]
## - cfin1 - density of Calanus finmarchicus kat. 1 plankton
## - cfin2 - density of Calanus finmarchicus kat. 2 plankton
## - chel1 - density of Calanus helgolandicus kat. 1 plankton
## - chel2 - density of Calanus helgolandicus kat. 2 plankton
## - lcop1 - density of copepods kat. 1
## - lcop2 - density of copepods kat. 2
## - fbar - intensity of fishing in region [remaining part of herring fry]
## - recr - annual number of herrings
## - cumf - intensity of annual fishing in region [remaining part of herring fry]
## - totaln - number of fished herrings
## - sst - sea surface temperature [Celsius]
## - sal - salinity [Kundsen ppt]
## - xmonth - number of fishing month
## - nao - north atlantic oscillation [mean sea level pressure]

After that dataset can be transferred to dyplr’s DataFrame object. By selecting just columns with numeric type and then checking if all column names match those from the raw set, we can make sure that all columns are ready for further work.

herrings <- tbl_df(dataset)
herrings %>% 
  select_if(is.numeric) %>%
  colnames
##  [1] "X"      "length" "cfin1"  "cfin2"  "chel1"  "chel2"  "lcop1" 
##  [8] "lcop2"  "fbar"   "recr"   "cumf"   "totaln" "sst"    "sal"   
## [15] "xmonth" "nao"

It is also required to transform xmonth column to cathegorical, as its values are numbers of months.

herrings <- herrings %>%
  mutate(xmonth = as.factor(xmonth))

Missing Values

All columns were extracted correctly and do not contain anything different from numeric variables. But there are still missing values (NAs) in the dataframe. Before deciding what to do with them it is crucial to know how many rows have NA value.

We started with showing the list of columns that have at least one NA value.

cols_with_na = c()
for(col in colnames(herrings)){
  if(any(is.na(herrings[col]))){
    cols_with_na <- cols_with_na %>% append(col)
  }
}
cols_with_na
## [1] "cfin1" "cfin2" "chel1" "chel2" "lcop1" "lcop2" "sst"

There are 7 such columns in herrings dataset.

Before making any decision about missing data it is important to somehow visualize it. It is recommended to not reconstruct variable if more than 5% of the data is not available. Also it is worth knowing missing data pattern. If it’s random it is good idea to try some imputation methods. But if it seems that there is some reason for data being not available it’s better to consider other methods (profound analysis of data gathering process is a way to start).

Following plots contains interesting information about missing data. It was done using VIM’s aggr function on dataset with just those columns which contain at least NA value. Left figure, “Histogram of missing data” informs us how many values of each column are missing. We can see that it is around 3% of all values for each column. It is less than mentioned 5% so data reconstruction is worth considering. More than that, as all columns has around 3% missing data it is first hint that not availability might be caused by some random process. The right chart, “Missing data pattern”, shows what is the pattern of missing data. All combinations of missing data were captured from the dataset and presented. Red color symbolizes NA, while blue indicate that correct value is available. Most common combinations are at the very bottom. Moving to the top there are less and less frequent combinations.

It seems that there are no strong relations between missing values. Next step is an imputation.

just_with_na <- herrings %>% select(cols_with_na)
aggr_plot <- aggr(just_with_na, col=c('navyblue','red'), numbers=FALSE, bars=TRUE, sortVars=TRUE, labels=names(just_with_na), cex.axis=.8, gap=3, ylab=c("Histogram of missing data","Missing data pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##     lcop1 0.03143661
##     lcop2 0.03025750
##       sst 0.03012438
##     cfin1 0.03006732
##     chel2 0.02959188
##     chel1 0.02957286
##     cfin2 0.02921152

Imputing Missing Data

In order to reconstruct missing data we used a mice library (Multivariate Imputation By Chained Equations). Seed for a random generator was set to 17. MICE tries to imput values in places of missing data based on the context from other attributes. One of attributes, sal, had to be excluded from this step as its participation resulted in error. MICE supports many methods from which one could choose by first trying its reconstruction results on smaller subsequence of data and compare that to ground truth. We omitted that step and went straight to fairly good pmm imputation method. Its name stands for "Predictive Mean Matching.

set.seed(17)
imputed <- mice(herrings,predictorMatrix = quickpred(herrings, exclude = c('sal')), meth='pmm', maxit=1)
## 
##  iter imp variable
##   1   1  cfin1  cfin2  chel1  chel2  lcop1  lcop2  sst
##   1   2  cfin1  cfin2  chel1  chel2  lcop1  lcop2  sst
##   1   3  cfin1  cfin2  chel1  chel2  lcop1  lcop2  sst
##   1   4  cfin1  cfin2  chel1  chel2  lcop1  lcop2  sst
##   1   5  cfin1  cfin2  chel1  chel2  lcop1  lcop2  sst
tmp_data <- complete(imputed,1)
imputed_data = mutate(tmp_data, sal = herrings$sal)

Attribute-wise Analysis

We started studying attributes by showing their summary which contains basic statistical metrics. There are no missing values as we got rid of them earlier.

kable_styling(kable(summary(imputed_data)))
X length cfin1 cfin2 chel1 chel2 lcop1 lcop2 fbar recr cumf totaln sst sal xmonth nao
Min. : 0 Min. :19.0 Min. : 0.0000 Min. : 0.0000 Min. : 0.000 Min. : 5.238 Min. : 0.3074 Min. : 7.849 Min. :0.0680 Min. : 140515 Min. :0.06833 Min. : 144137 Min. :12.77 Min. :35.40 8 : 9920 Min. :-4.89000
1st Qu.:13145 1st Qu.:24.0 1st Qu.: 0.0000 1st Qu.: 0.2778 1st Qu.: 2.469 1st Qu.:13.427 1st Qu.: 2.5479 1st Qu.:17.808 1st Qu.:0.2270 1st Qu.: 360061 1st Qu.:0.14809 1st Qu.: 306068 1st Qu.:13.60 1st Qu.:35.51 10 : 7972 1st Qu.:-1.89000
Median :26290 Median :25.5 Median : 0.1111 Median : 0.7012 Median : 5.750 Median :21.435 Median : 7.0000 Median :24.859 Median :0.3320 Median : 421391 Median :0.23191 Median : 539558 Median :13.86 Median :35.51 7 : 6922 Median : 0.20000
Mean :26290 Mean :25.3 Mean : 0.4457 Mean : 2.0228 Mean :10.003 Mean :21.204 Mean : 12.8051 Mean :28.422 Mean :0.3304 Mean : 520366 Mean :0.22981 Mean : 514973 Mean :13.87 Mean :35.51 9 : 5714 Mean :-0.09236
3rd Qu.:39436 3rd Qu.:26.5 3rd Qu.: 0.3333 3rd Qu.: 1.7936 3rd Qu.:11.500 3rd Qu.:27.193 3rd Qu.: 21.2315 3rd Qu.:37.232 3rd Qu.:0.4560 3rd Qu.: 724151 3rd Qu.:0.29803 3rd Qu.: 730351 3rd Qu.:14.16 3rd Qu.:35.52 6 : 4218 3rd Qu.: 1.63000
Max. :52581 Max. :32.5 Max. :37.6667 Max. :19.3958 Max. :75.000 Max. :57.706 Max. :115.5833 Max. :68.736 Max. :0.8490 Max. :1565890 Max. :0.39801 Max. :1015595 Max. :14.73 Max. :35.61 5 : 3736 Max. : 5.08000
NA NA NA NA NA NA NA NA NA NA NA NA NA NA (Other):14100 NA

Figure below presents correlation matrix (or rather its upper part). Attribute length is negatively correlated with sea surface temperature, but besides that it is fairly weakly connected with other attributes.

#corrplot(imputed_data, type="upper", order="hclust", tl.col="black", tl.srt=45)
cor_data <- select(imputed_data, -c('xmonth', 'X'))
corrplot(cor(cor_data), type='upper', order='hclust', tl.col='black', tl.srt=55, title="Correlation matrix")

The chart below presents how a leanth of herrings was changing over time. Blue line is an effect of smooth function. It allows us to clearly see trend present in data. There is noticeable breakpoint.

tmp <- imputed_data %>%
          dplyr::slice(seq(1, nrow(imputed_data), 50))

p <- ggplot(tmp, aes(x=X, y=length)) +
    geom_point(shape=1) + stat_smooth()
p <- ggplotly(p)
p

It is standard practice to check distributions of data plotting histograms and measure the distribution skewness.

for(col in colnames(imputed_data)){
  if(col != "X"){
  plot <- qplot(imputed_data[[col]], xlab=col) + ggtitle("")
  print(plot)
  print(skewness(imputed_data[[col]]))
  }
}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -0.09947805
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 9.029307
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 3.422597
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 3.231852
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.4729533
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 2.450582
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.8696977
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.4447865
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 1.043071
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -0.01135863
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] 0.0624455
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -0.04888691
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -0.5374667
## Warning in mean.default(x): argument is not numeric or logical: returning
## NA
## Warning in Ops.factor(x, mean(x)): '-' not meaningful for factors

## [1] NA
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## [1] -0.02178226

There are some variables that should be looked closer upon. For those that skewness was > 2 a log transform will be used. There could be more analysis done to all of the variables but authors decide that it is enough to further investigate only some of them. Depending on the number of zeros there will be a constant (before a log transform) added or dummy variable introduced that takes value of 1 if the column was zero and 0 otherwise. This way a more suitable distribution can be obtained and important information regarding zeros in the data kept.

number_of_zeros = NROW(imputed_data$cfin1[imputed_data$cfin1== 0])

For the variable cfin1 - log transform was used and a dummy variable introduced. There were 0.2804192% zero values in this column.

  print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros:  14745"
  imputed_data$cfin1Changed[imputed_data$cfin1 > 0] <- log(imputed_data$cfin1[imputed_data$cfin1 > 0])
  imputed_data$cfin1Changed[imputed_data$cfin1 <= 0] <- 0
  plot <- qplot(imputed_data$cfin1Changed[imputed_data$cfin1 > 0], xlab="cfin1", binwidth=0.05)
  print(plot)

  imputed_data$cfin1Zero <- as.numeric(imputed_data$cfin1 == 0)
  print(qplot(imputed_data$cfin1Zero,xlab="cfin1Zero"))

number_of_zeros = NROW(imputed_data$cfin2[imputed_data$cfin2== 0])

Same technique was used for cfin2 variable as there were still 0.0746453% records with zero values.

  print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros:  3925"
  imputed_data$cfin2Changed[imputed_data$cfin2 > 0] <- log(imputed_data$cfin2[imputed_data$cfin2 > 0])
  imputed_data$cfin2Changed[imputed_data$cfin2 <= 0] <- 0
  plot <- qplot(imputed_data$cfin2Changed[imputed_data$cfin2 > 0], xlab="cfin2", binwidth=0.05)
  print(plot)

  imputed_data$cfin2Zero <- as.numeric(imputed_data$cfin2 == 0)
  print(qplot(imputed_data$cfin2Zero,xlab="cfin2Zero"))

number_of_zeros = NROW(imputed_data$chel1[imputed_data$chel1== 0])

For ‘chel1’ there were 0.0365905% zero values (<5%) that is why log transform with an added constant was preferred. The 5% value is not a rule but rather a threshold authors decided on. More appropriate solution could be to test all possible variable setups.

  print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros:  1924"
  imputed_data$chel1Changed <- log(imputed_data$chel1 +1)
  
  plot <- qplot(imputed_data$chel1Changed, xlab="chel1", binwidth=0.05)
  print(plot)

number_of_zeros = NROW(imputed_data$lcop1[imputed_data$lcop1== 0])

“lcop1” was only log transformed with an added constant. There were 0% zero values!

  print(paste("Number of zeros: ", number_of_zeros))
## [1] "Number of zeros:  0"
  imputed_data$lcop1Changed <- log(imputed_data$lcop1 +1)
  plot <- qplot(imputed_data$lcop1Changed, xlab="lcop1", binwidth=0.05)
  print(plot)

Regression

First we start with turning xmonth into dummy variable. Now there are new columns for each month. Zero indicates that sample was not taken in this month and one if it was taken in month which is specified in the name of the column.

reg_data <- model.matrix(X ~ ., imputed_data)
head(reg_data)
##   (Intercept) length   cfin1   cfin2   chel1    chel2   lcop1    lcop2
## 1           1   23.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 2           1   22.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 3           1   25.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 4           1   25.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 5           1   24.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
## 6           1   22.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881
##    fbar   recr      cumf   totaln      sst      sal xmonth2 xmonth3
## 1 0.356 482831 0.3059879 267380.8 14.30693 35.51234       0       0
## 2 0.356 482831 0.3059879 267380.8 14.30693 35.51234       0       0
## 3 0.356 482831 0.3059879 267380.8 14.30693 35.51234       0       0
## 4 0.356 482831 0.3059879 267380.8 14.30693 35.51234       0       0
## 5 0.356 482831 0.3059879 267380.8 14.30693 35.51234       0       0
## 6 0.356 482831 0.3059879 267380.8 14.30693 35.51234       0       0
##   xmonth4 xmonth5 xmonth6 xmonth7 xmonth8 xmonth9 xmonth10 xmonth11
## 1       0       0       0       1       0       0        0        0
## 2       0       0       0       1       0       0        0        0
## 3       0       0       0       1       0       0        0        0
## 4       0       0       0       1       0       0        0        0
## 5       0       0       0       1       0       0        0        0
## 6       0       0       0       1       0       0        0        0
##   xmonth12 nao cfin1Changed cfin1Zero cfin2Changed cfin2Zero chel1Changed
## 1        0 2.8    -3.583439         0    -1.280674         0     1.243794
## 2        0 2.8    -3.583439         0    -1.280674         0     1.243794
## 3        0 2.8    -3.583439         0    -1.280674         0     1.243794
## 4        0 2.8    -3.583439         0    -1.280674         0     1.243794
## 5        0 2.8    -3.583439         0    -1.280674         0     1.243794
## 6        0 2.8    -3.583439         0    -1.280674         0     1.243794
##   lcop1Changed
## 1     1.266347
## 2     1.266347
## 3     1.266347
## 4     1.266347
## 5     1.266347
## 6     1.266347

Some attributes in this dataset has values measured in tens of thousands while others never exceeds a value of one. If our data will stay in that form some attributes will contribute greatly to the final result while others will be ignored. To make sure that all data equaly participates in learning process it has to be normalized in one way or another. We decided to use simple scale and centering methods from caret library. But before that data had to be divided into test and training set. Fitting normalization model can be done just on training dataset, as it would introduce data leak if conducted on the whole dataset.

ctrl <- trainControl(method = "cv", number=8, preProcOptions = c(pcaComp=32), allowParallel = TRUE)
grid <- expand.grid()
lm <- train(length~., data = reg_data, method='lm', trControl=ctrl, preProcess=c('center', 'scale', 'pca'))
lm
## Linear Regression 
## 
## 52582 samples
##    31 predictor
## 
## Pre-processing: centered (30), scaled (30), principal component
##  signal extraction (30) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 46010, 46009, 46009, 46010, 46009, 46008, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   1.322786  0.359529  1.053329
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
registerDoMC(cores=4)
ctrl <- trainControl(method = "cv", number=8, preProcOptions = c(psaComp=3), allowParallel = TRUE)
grid <- expand.grid(max_depth=c(3,4,5), 
                    subsample=c(1.0), 
                    nrounds=c(200),
                    eta=c(0.4, 0.3),
                    gamma=c(0.001,0.01),
                    colsample_bytree=c(0.7),
                    min_child_weight=c(1.0, 0.5))
xgb_tree <- train(length~., data = reg_data, method='xgbTree', trControl=ctrl, tuneGrid=grid, preProcess=c('center', 'scale'))
xgb_tree
## eXtreme Gradient Boosting 
## 
## 52582 samples
##    31 predictor
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 46010, 46010, 46010, 46009, 46009, 46008, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  gamma  min_child_weight  RMSE      Rsquared   MAE      
##   0.3  3          0.001  0.5               1.143325  0.5215935  0.9005966
##   0.3  3          0.001  1.0               1.142940  0.5219241  0.9004551
##   0.3  3          0.010  0.5               1.143485  0.5214670  0.9006081
##   0.3  3          0.010  1.0               1.143756  0.5212325  0.9009772
##   0.3  4          0.001  0.5               1.138971  0.5252179  0.8962361
##   0.3  4          0.001  1.0               1.139873  0.5244800  0.8971574
##   0.3  4          0.010  0.5               1.139346  0.5249190  0.8968028
##   0.3  4          0.010  1.0               1.139225  0.5250204  0.8963911
##   0.3  5          0.001  0.5               1.139983  0.5244590  0.8965880
##   0.3  5          0.001  1.0               1.140276  0.5242202  0.8969980
##   0.3  5          0.010  0.5               1.139538  0.5248331  0.8961700
##   0.3  5          0.010  1.0               1.139754  0.5246495  0.8963924
##   0.4  3          0.001  0.5               1.142713  0.5220969  0.9000945
##   0.4  3          0.001  1.0               1.142264  0.5224841  0.8995259
##   0.4  3          0.010  0.5               1.141527  0.5231022  0.8986428
##   0.4  3          0.010  1.0               1.141946  0.5227398  0.8987036
##   0.4  4          0.001  0.5               1.139576  0.5247556  0.8965799
##   0.4  4          0.001  1.0               1.140238  0.5242225  0.8967870
##   0.4  4          0.010  0.5               1.140291  0.5241734  0.8969215
##   0.4  4          0.010  1.0               1.140240  0.5242122  0.8969085
##   0.4  5          0.001  0.5               1.141842  0.5229918  0.8974913
##   0.4  5          0.001  1.0               1.141842  0.5230002  0.8976417
##   0.4  5          0.010  0.5               1.141364  0.5233983  0.8972330
##   0.4  5          0.010  1.0               1.141777  0.5230539  0.8975537
## 
## Tuning parameter 'nrounds' was held constant at a value of 200
## 
## Tuning parameter 'colsample_bytree' was held constant at a value of
##  0.7
## Tuning parameter 'subsample' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 200, max_depth = 4,
##  eta = 0.3, gamma = 0.001, colsample_bytree = 0.7, min_child_weight =
##  0.5 and subsample = 1.